fix: union all by name#15603
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you for this @chenkovsky (and all the other PRs recently -- very much appreciated)
| #[pin] | ||
| stream: S, | ||
|
|
||
| transform_schema: bool, |
There was a problem hiding this comment.
this seems like this is fixing the symptom rather than the root cause
I think it would be better to have the correct schema reflected in the plan in the first place 🤔
There was a problem hiding this comment.
yes, correct nullability in schema is better. I tried to fix logical plan before.
But nullability in logical plan won't affect physical plan. it's ignored.
in physical plan, it will recompute nullaibility from bottom to top.
but in this scenario, it seems that we need to pass nullability from top to bottom.
I need more suggestions.
There was a problem hiding this comment.
I want to learn some experience from spark.
for logical plan, I haven't found any logic to handle this problem.
for physical plan, spark is much easier, its InternalRow is schemaless. so it will use the schema of physical plan by default. but recordbatch contains schema.
I'm not 100% sure, I think current logical plan and physical plan schema is correct. the root cause is that recordbatch's schema doesn't match physical plan's. so adding an adapter is a proper way.
| let ret = this.stream.poll_next(cx); | ||
| if transform_schema { | ||
| if let Poll::Ready(Some(Ok(batch))) = ret { | ||
| return Poll::Ready(Some(batch.with_schema(schema).map_err(|e| { |
There was a problem hiding this comment.
I think this is one of the notorious problems when logical schema doesn't match the physical one on nullability/metadata. But this change might bring a performance impact, although the schema change is just reassigning the value but it also calls schema_contains which may be expensive
There was a problem hiding this comment.
yes, there's performance concern now. if this approach is feasible, I can try to optimize it, maybe use RecordBatch::try_new_with_options
|
Thanks for looking into the nullable issue, it's been on my plate for a bit to look into some more. It's really the last blocker I know of for union by name to work correctly. |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
nullable: trueif any of the branches produces fields withnullable: trueon the same position #15394.Rationale for this change
schema from inner physical plan is returned.
What changes are included in this PR?
update UnionExec and RecordBatchStreamAdapter to transform schema.
I found that logical plan's nullability for Projection is also not correct after optimization,
But this won't make the test fail. So I haven't included this part in this PR. Do we need to correct logical plan?
Are these changes tested?
UT
Are there any user-facing changes?
No